Conversation
|
cc @inducer |
pyop2/base.py
Outdated
| """ | ||
| AVAILABLE_ON_HOST_ONLY = 1 | ||
| AVAILABLE_ON_DEVICE_ONLY = 2 | ||
| AVAILABLE_ON_BOTH = 3 |
There was a problem hiding this comment.
Maybe IntFlag instead of IntEnum? (just something to perhaps consider...)
wence-
left a comment
There was a problem hiding this comment.
Didn't manage everything, but a bunch of comments/queries.
pyop2/transforms/snpt.py
Outdated
| return arg | ||
|
|
||
|
|
||
| def snpt_transform(kernel, block_size): |
There was a problem hiding this comment.
Renamed it to split_n_across_workgroups, better?
pyop2/backends/cuda.py
Outdated
| if self.can_be_represented_as_petscvec(): | ||
| # report to petsc that we are altering data on the CPU | ||
| with self.vec as petscvec: | ||
| petscvec.array_w |
There was a problem hiding this comment.
Again, comments about vec state management apply here, and mutatis mutandis, the rest of the PR.
pyop2/backends/cuda.py
Outdated
| grid = tuple(int(evaluate(glens[i], parameters)) if i < len(glens) else 1 | ||
| for i in range(2)) | ||
| block = tuple(int(evaluate(llens[i], parameters)) if i < len(llens) else 1 | ||
| for i in range(3)) |
There was a problem hiding this comment.
This cached data is generated by a computer. Therefore there should be no need to parse and evaluate it, because the on-disk representation can just be the in-memory representation.
There was a problem hiding this comment.
This cached data is generated by a computer
I preferred a human-readable format to keep the representation valid across pyop2-versions which might not be the case with pickling. (Happy to serialize in a different format if there's a better alternative)
no need to parse and evaluate it,
I don't think we can avoid the evaluates because the grid sizes are typically symbolic expressions of the form: (end - start) // 32 + 1. Therefore the grid sizes are evaluated at runtime during a GlobalKernel.__call__ invocation.
There was a problem hiding this comment.
Pickle version incompatibilities are fine (they would appear when you upgrade python and we don't guarantee very much about the portability of on disk cached data between firedrake-update calls, let alone python version upgrades).
What you're attempting to do is cache a function that computes these things. So in the code_to_compile case, just write that function, and pickle it.
Although is this a consequence of the on-disk kernel/compiled code not remembering the loopy representation?
There was a problem hiding this comment.
Pickle version incompatibilities are fin
Cool, will change. (Thanks!)
Although is this a consequence of the on-disk kernel/compiled code not remembering the loopy representation?
Yes, the compiled code stores only the plain CUDA/OpenCL device kernel. Loopy's cache on disk mechanism works by storing both the device kernel and the host code that would call end up calling this device kernel. But adapting PyOP2 to call loopy kernels would be a bit disruptive and I think loopy's cache dir being $XDG_CACHE_DIR has run into some problems in the past.
There was a problem hiding this comment.
just write that function,
Planning to simply generate a python function corresponding to the expression: inducer/pymbolic#107
There was a problem hiding this comment.
For my education, how come you chose to not go with the generated invokers?
There was a problem hiding this comment.
For PyOpenCL we could borrow the invoker, but for PyCUDA/C-target we don't have an invoker in loopy (yet).
There was a problem hiding this comment.
- Also, for the pyopencl invoker, a minor concern is that the disk-caching semantics don't play well with PyOP2's disk-caching semantics.
- I feel adding an invoker for the PyCUDA target shouldn't be too difficult. Taking a look at it.
There was a problem hiding this comment.
There is invoker generation for the C target. I'd expect that putting together CUDA invokers wouldn't be a massive effort based on a copy-paste job of the CL ones. Do you think that might be the right approach?
There was a problem hiding this comment.
I am working on integrating the PyCUDA/PyOpenCL invokers. For the C-target, reorganizing pyop2/compilation.py is a bigger undertaking and also out of the scope for this PR.
pyop2/backends/cuda.py
Outdated
| .callables_table)) | ||
| f.write("("+",".join(str(glen) for glen in glens)+",)") | ||
| f.write("\n") | ||
| f.write("("+",".join(str(llen) for llen in llens)+",)") |
There was a problem hiding this comment.
This is data to be consumed by a computer, so don't make it human-readable and require parsing.
There was a problem hiding this comment.
Do you recommend pickling?
pyop2/backends/cuda.py
Outdated
| fpath = self.extra_args_cache_file_path | ||
| extra_args_np = [arg | ||
| for _, arg in natsorted(numpy.load(fpath).items(), | ||
| key=lambda x: x[0])] |
There was a problem hiding this comment.
This is tremendously fragile. Please think of a better way of saving this data in a way that doesn't rely on dubiously-stable properties of numpy.savez/load.
There was a problem hiding this comment.
Agreed. While np.savez-ing the arrays also stored a string-array that tells us the keys that were used for storing these arrays. See https://github.com/OP2/PyOP2/compare/9b7eb97878f2b6609d64affe896c5d4d901a2934..0e58f08d3ea1487e01c45df2270fa9e5790a845d.
3f834d3 to
5810a41
Compare
16720f6 to
96f54ce
Compare
5bed614 to
3df2554
Compare
98c560b to
f58780c
Compare
f521085 to
fc1575e
Compare
|
@kaushikcfd just to make you aware of a couple things:
|
|
Hi Connor!
Let me know if anything in this patch isn't clear. |
|
@kaushikcfd I just read the GPU chapter of your thesis (and liked it!). Looking through the code here I can’t find the second transform that you describe nor the various heuristics. Are those available anywhere? |
You can find them here: https://github.com/OP2/PyOP2/tree/auto_tiling. |
|
Ah great. Thanks! |
Draft because: